107 research outputs found

    Towards Socially Responsible AI: Cognitive Bias-Aware Multi-Objective Learning

    Full text link
    Human society had a long history of suffering from cognitive biases leading to social prejudices and mass injustice. The prevalent existence of cognitive biases in large volumes of historical data can pose a threat of being manifested as unethical and seemingly inhuman predictions as outputs of AI systems trained on such data. To alleviate this problem, we propose a bias-aware multi-objective learning framework that given a set of identity attributes (e.g. gender, ethnicity etc.) and a subset of sensitive categories of the possible classes of prediction outputs, learns to reduce the frequency of predicting certain combinations of them, e.g. predicting stereotypes such as `most blacks use abusive language', or `fear is a virtue of women'. Our experiments conducted on an emotion prediction task with balanced class priors shows that a set of baseline bias-agnostic models exhibit cognitive biases with respect to gender, such as women are prone to be afraid whereas men are more prone to be angry. In contrast, our proposed bias-aware multi-objective learning methodology is shown to reduce such biases in the predictied emotions

    Exploring sentence level query expansion in language modeling based information retrieval

    Get PDF
    We introduce two novel methods for query expansion in information retrieval (IR). The basis of these methods is to add the most similar sentences extracted from pseudo-relevant documents to the original query. The first method adds a fixed number of sentences to the original query, the second a progressively decreasing number of sentences. We evaluate these methods on the English and Bengali test collections from the FIRE workshops. The major findings of this study are that: i) performance is similar for both English and Bengali; ii) employing a smaller context (similar sentences) yields a considerably higher mean average precision (MAP) compared to extracting terms from full documents (up to 5.9% improvemnent in MAP for English and 10.7% for Bengali compared to standard Blind Relevance Feedback (BRF); iii) using a variable number of sentences for query expansion performs better and shows less variance in the best MAP for different parameter settings; iv) query expansion based on sentences can improve performance even for topics with low initial retrieval precision where standard BRF fails

    DCU@FIRE2010: term conflation, blind relevance feedback, and cross-language IR with manual and automatic query translation

    Get PDF
    For the first participation of Dublin City University (DCU) in the FIRE 2010 evaluation campaign, information retrieval (IR) experiments on English, Bengali, Hindi, and Marathi documents were performed to investigate term conation (different stemming approaches and indexing word prefixes), blind relevance feedback, and manual and automatic query translation. The experiments are based on BM25 and on language modeling (LM) for IR. Results show that term conation always improves mean average precision (MAP) compared to indexing unprocessed word forms, but different approaches seem to work best for different languages. For example, in monolingual Marathi experiments indexing 5-prefixes outperforms our corpus-based stemmer; in Hindi, the corpus-based stemmer achieves a higher MAP. For Bengali, the LM retrieval model achieves a much higher MAP than BM25 (0.4944 vs. 0.4526). In all experiments using BM25, blind relevance feedback yields considerably higher MAP in comparison to experiments without it. Bilingual IR experiments (English!Bengali and English!Hindi) are based on query translations obtained from native speakers and the Google translate web service. For the automatically translated queries, MAP is slightly (but not significantly) lower compared to experiments with manual query translations. The bilingual English!Bengali (English!Hindi) experiments achieve 81.7%-83.3% (78.0%-80.6%) of the best corresponding monolingual experiments

    Simulation of within-session query variations using a text segmentation approach

    Get PDF
    We propose a generative model for automatic query refor- mulations from an initial query using the underlying subtopic structure of top ranked retrieved documents. We address three types of query reformulations a) specialization; b) generalization; and c) drift. To test our model we generate the three reformulation variants starting with selected fields from the TREC-8 topics as the initial queries. We use manual judgments from multiple assessors to calculate the accuracy of the reformulated query variants and observe accuracies of 65%, 82% and 69% respectively for specialization, generalization and drift reformulations

    DCU and ISI@INEX 2010: Ad-hoc and data-centric tracks

    Get PDF
    We describe the participation of Dublin City University (DCU)and the Indian Statistical Institute (ISI) in INEX 2010. The main contributions of this paper are: i) a simplified version of Hierarchical Language Model (HLM) which involves scoring XML elements with a combined probability of generating the given query from itself and the top level article node, is shown to outperform the baselines of Language Model (LM) and Vector Space Model (VSM) scoring of XML elements; ii) the Expectation Maximization (EM) feedback in LM is shown to be the most effective on the domain specic collection of IMDB; iii) automated removal of sentences indicating aspects of irrelevance from the narratives of INEX ad-hoc topics is shown to improve retrieval eectiveness

    Utilizing sub-topical structure of documents for information retrieval.

    Get PDF
    Text segmentation in natural language processing typically refers to the process of decomposing a document into constituent subtopics. Our work centers on the application of text segmentation techniques within information retrieval (IR) tasks. For example, for scoring a document by combining the retrieval scores of its constituent segments, exploiting the proximity of query terms in documents for ad-hoc search, and for question answering (QA), where retrieved passages from multiple documents are aggregated and presented as a single document to a searcher. Feedback in ad hoc IR task is shown to benefit from the use of extracted sentences instead of terms from the pseudo relevant documents for query expansion. Retrieval effectiveness for patent prior art search task is enhanced by applying text segmentation to the patent queries. Another aspect of our work involves augmenting text segmentation techniques to produce segments which are more readable with less unresolved anaphora. This is particularly useful for QA and snippet generation tasks where the objective is to aggregate relevant and novel information from multiple documents satisfying user information need on one hand, and ensuring that the automatically generated content presented to the user is easily readable without reference to the original source document

    Segmenting and summarizing general events in a long-term lifelog

    Get PDF
    Lifelogging aims to capture a person’s life experiences using digital devices. When captured over an extended period of time a lifelog can potentially contain millions of files from various sources in a range of formats. For lifelogs containing such massive numbers of items, we believe it is important to group them into meaningful sets and summarize them, so that users can search and browse their lifelog data efficiently. Existing studies have explored the segmentation of continuously captured images over short periods of at most a few days into small groups of “events” (episodes). Yet, for long-term lifelogs, higher levels of abstraction are desirable due to the very large number of “events” which will occur over an extended period. We aim to segment a long-term lifelog at the level of general events which typically extend beyond a daily boundary, and to select summary information to represent these events. We describe our current work on higher level segmentation and summary information extraction for long term life logs and report a preliminary pilot study on a real long-term lifelog collection

    United we fall, divided we stand: A study of query segmentation and PRF for patent prior art search

    Get PDF
    Previous research in patent search has shown that reducing queries by extracting a few key terms is ineffective primarily because of the vocabulary mismatch between patent applications used as queries and existing patent documents. This finding has led to the use of full patent applications as queries in patent prior art search. In addition, standard information retrieval (IR) techniques such as query expansion (QE) do not work effectively with patent queries, principally because of the presence of noise terms in the massive queries. In this study, we take a new approach to QE for patent search. Text segmentation is used to decompose a patent query into selfcoherent sub-topic blocks. Each of these much shorted sub-topic blocks which is representative of a specific aspect or facet of the invention, is then used as a query to retrieve documents. Documents retrieved using the different resulting sub-queries or query streams are interleaved to construct a final ranked list. This technique can exploit the potential benefit of QE since the segmented queries are generally more focused and less ambiguous than the full patent query. Experiments on the CLEF-2010 IP prior-art search task show that the proposed method outperforms the retrieval effectiveness achieved when using a single full patent application text as the query, and also demonstrates the potential benefits of QE to alleviate the vocabulary mismatch problem in patent search

    Enhanced information retrieval using domain-specific recommender models

    Get PDF
    The objective of an information retrieval (IR) system is to retrieve relevant items which meet a user information need. There is currently significant interest in personalized IR which seeks to improve IR effectiveness by incorporating a model of the user’s interests. However, in some situations there may be no opportunity to learn about the interests of a specific user on a certain topic. In our work, we propose an IR approach which combines a recommender algorithm with IR methods to improve retrieval for domains where the system has no opportunity to learn prior information about the user’s knowledge of a domain for which they have not previously entered a query. We use search data from other previous users interested in the same topic to build a recommender model for this topic. When a user enters a query on a topic, new to this user, an appropriate recommender model is selected and used to predict a ranking which the user may find interesting based on the behaviour of previous users with similar queries. The recommender output is integrated with a standard IR method in a weighted linear combination to provide a final result for the user. Experiments using the INEX 2009 data collection with a simulated recommender training set show that our approach can improve on a baseline IR system
    corecore